The coursework involves an individual analysis of two datasets: crime23.csv and temp2023.csv. These datasets relate to street-level crime incidents and daily climate data in Colchester during the year 2023. The crime data has been extracted using an interface that provides detailed descriptions of the variables, accessible at https://ukpolice.njtierney.com/reference/ukp_crime.html. Similarly, the climate data was collected from a weather station near Colchester, with variable descriptions and extractions interface available at https://bczernecki.github.io/climate/reference/meteo_ogimet.html.
The objective is to conduct a comprehensive analysis of the datasets, and the aim is to explore patterns, trends and relationships within the crime and climate data and gain insights into factors influencing the occurrences and potential relationships within the datasets.
The analysis will include descriptive statistics, data visualisation, and correlation analysis to uncover patterns and relations within the datasets. Advanced graphics and interactive plots will be utilised to enhance the presentation of the findings. The project will be carried out using R Markdown, providing a detailed interpretation of the results.
# Load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(leaflet)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
##
## Attaching package: 'ggmap'
##
##
## The following object is masked from 'package:plotly':
##
## wind
# Load the dataset
colc_crime_data <- read_csv("crime23.csv")
## Rows: 6878 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): category, persistent_id, date, street_name, location_type, location...
## dbl (4): lat, long, street_id, id
## lgl (1): context
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the structure of the data frame
str(colc_crime_data)
## spc_tbl_ [6,878 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ category : chr [1:6878] "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr [1:6878] NA NA NA NA ...
## $ date : chr [1:6878] "2023-01" "2023-01" "2023-01" "2023-01" ...
## $ lat : num [1:6878] 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num [1:6878] 0.909 0.902 0.898 0.902 0.895 ...
## $ street_id : num [1:6878] 2153366 2153173 2153077 2153186 2153012 ...
## $ street_name : chr [1:6878] "On or near Military Road" "On or near" "On or near Culver Street West" "On or near Ryegate Road" ...
## $ context : logi [1:6878] NA NA NA NA NA NA ...
## $ id : num [1:6878] 1.08e+08 1.08e+08 1.08e+08 1.08e+08 1.08e+08 ...
## $ location_type : chr [1:6878] "Force" "Force" "Force" "Force" ...
## $ location_subtype: chr [1:6878] NA NA NA NA ...
## $ outcome_status : chr [1:6878] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. category = col_character(),
## .. persistent_id = col_character(),
## .. date = col_character(),
## .. lat = col_double(),
## .. long = col_double(),
## .. street_id = col_double(),
## .. street_name = col_character(),
## .. context = col_logical(),
## .. id = col_double(),
## .. location_type = col_character(),
## .. location_subtype = col_character(),
## .. outcome_status = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The dataset has 6,878 observations and 12 attributes. Here’s a look into the columns:
category is s character vector indicating the category of each crime reported.
persistent_id is a character vector representing the persistent ID for each crime. It contains several missing values (NA).
date is a character vector representing the date of each crime reported, formatted as YYYY-MM.
lat is a numeric vector representing the latitude of each crime location.
long is a numeric vector representing the longitude of each crime location.
street_id is a numeric vector representing the unique identifier for the street where each crime occurred.
street_name is a character vector representing the name of the location where each crime occurred. It contains some missing values.
context is a logical vector indicating if there is any additional context for each crime. It appears to contain missing values.
id is a numeric vector representing the ID of each crime. It is likely to be a unique identifier.
location_type is a character vector representing the type of location where each crime was recorded (e.g., “Force” and “BPT”).
location_subtype is a character vector representing the subtype of location for each crime. It contains some missing values.
outcome_status is a character vector representing the outcome status of each crime. It also contains missing values.
# Check for missing values in the dataset
sum(is.na(colc_crime_data))
## [1] 15110
There are 15,110 missing values across the dataset
# Calculate the number of missing values for each column
missing_val <- colSums(is.na(colc_crime_data))
# Filter columns with missing values
missing_col <- missing_val[missing_val > 0]
# Create a data frame with columns and their respective missing value counts
missing_col_with_val <- data.frame(Column = names(missing_col), Missing_Val = missing_col, row.names = NULL)
# Display the missing values with their columns
missing_col_with_val
## Column Missing_Val
## 1 persistent_id 701
## 2 context 6878
## 3 location_subtype 6854
## 4 outcome_status 677
The results show that “persistent_id” has 701 missing values, “context” has all the columns missing, “location_subtype” has a high count of 6,854 missing values, and “outcome_status” records 677 missing values.
# Handle missing data
# Drop the "context" and "location_subtype" column
colc_crime_data <- colc_crime_data[, !(names(colc_crime_data) %in% c("context", "location_subtype"))]
# Replace missing values in persistent_id and outcome_status with "Unknown"
colc_crime_data <- colc_crime_data %>%
mutate(persistent_id = ifelse(is.na(persistent_id), "Unknown", persistent_id),
outcome_status = ifelse(is.na(outcome_status), "Unknown", outcome_status))
# confirm the change
head(colc_crime_data)
## # A tibble: 6 × 10
## category persistent_id date lat long street_id street_name id
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 anti-social-beha… Unknown 2023… 51.9 0.909 2153366 On or near… 1.08e8
## 2 anti-social-beha… Unknown 2023… 51.9 0.902 2153173 On or near 1.08e8
## 3 anti-social-beha… Unknown 2023… 51.9 0.898 2153077 On or near… 1.08e8
## 4 anti-social-beha… Unknown 2023… 51.9 0.902 2153186 On or near… 1.08e8
## 5 anti-social-beha… Unknown 2023… 51.9 0.895 2153012 On or near… 1.08e8
## 6 anti-social-beha… Unknown 2023… 51.9 0.909 2153379 On or near… 1.08e8
## # ℹ 2 more variables: location_type <chr>, outcome_status <chr>
In handling the missing data, the “context” and “location_subtype” columns were dropped, while the missing values in the “persistent_id” and “outcome_status” were replaced by “Unknown”. The “context” and “location_subtype” columns are virtually empty; hence, they were dropped as they do not provide helpful information for the analysis. Replacing missing values with ‘Unknown’ maintains the dataset’s completeness. This approach effectively accounts for the missing values in the ‘persistent_id’ and ‘outcome_status’ columns, allowing for analysis and visualisation of the data and also helps to provide valuable insights into the recorded crime incidents in Colchester in 2023.
# Check for duplicates
duplicate_rows <- colc_crime_data[duplicated(colc_crime_data),]
print(duplicate_rows)
## # A tibble: 0 × 10
## # ℹ 10 variables: category <chr>, persistent_id <chr>, date <chr>, lat <dbl>,
## # long <dbl>, street_id <dbl>, street_name <chr>, id <dbl>,
## # location_type <chr>, outcome_status <chr>
The result above shows no duplicate rows in the dataset; hence, there is no need to remove duplicates.
# Load library
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Convert date column to year-month format
colc_crime_data$date <- as.yearmon(colc_crime_data$date)
# Convert date column to Date format
colc_crime_data$date <- as.Date(colc_crime_data$date)
# Confirm the new date format
str(colc_crime_data)
## tibble [6,878 × 10] (S3: tbl_df/tbl/data.frame)
## $ category : chr [1:6878] "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr [1:6878] "Unknown" "Unknown" "Unknown" "Unknown" ...
## $ date : Date[1:6878], format: "2023-01-01" "2023-01-01" ...
## $ lat : num [1:6878] 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num [1:6878] 0.909 0.902 0.898 0.902 0.895 ...
## $ street_id : num [1:6878] 2153366 2153173 2153077 2153186 2153012 ...
## $ street_name : chr [1:6878] "On or near Military Road" "On or near" "On or near Culver Street West" "On or near Ryegate Road" ...
## $ id : num [1:6878] 1.08e+08 1.08e+08 1.08e+08 1.08e+08 1.08e+08 ...
## $ location_type : chr [1:6878] "Force" "Force" "Force" "Force" ...
## $ outcome_status: chr [1:6878] "Unknown" "Unknown" "Unknown" "Unknown" ...
The character format in the “date” didn’t convert directly to a date-time format; hence, the ‘as.yearmon()’ function from the ‘zoo’ package was used to convert to the year-month format. Then, the as.Date function was used to to convert it to Date format.
# Check unique values in categorical columns
unique_val <- sapply(colc_crime_data[, sapply(colc_crime_data, is.character)],
function(x) length(unique(x)))
# print unique values
unique_val
## category persistent_id street_name location_type outcome_status
## 14 6177 351 2 14
The dataset comprises various categorical columns with distinct characteristics. In the “category” column, there are 14 unique categories representing different types of reported crimes. each incident is identified by a “persistent_id”, with 6177 unique identifiers, which enables tracking of individual occurrences. The data spans 12 months, recorded in “date” column, which indicates the temporal distribution of the reported incidents. Streets where crimes occur are diverse, with 351 unique street names captured in the “street_name” column. Location information is categorised into two types, which is given as the “location_type” column, which contains 2 unique types. The outcomes of the reported incidents are shown in the “outcome_status” column, revealing 14 distinct statuses indicating the resolution of each case.
# create a two-way table for category and outcome status
two_way_table <- table(colc_crime_data$category, colc_crime_data$outcome_status)
two_way_table
##
## Action to be taken by another organisation
## anti-social-behaviour 0
## bicycle-theft 0
## burglary 0
## criminal-damage-arson 5
## drugs 2
## other-crime 1
## other-theft 1
## possession-of-weapons 1
## public-order 6
## robbery 0
## shoplifting 5
## theft-from-the-person 0
## vehicle-crime 0
## violent-crime 83
##
## Awaiting court outcome Court result unavailable
## anti-social-behaviour 0 0
## bicycle-theft 0 1
## burglary 1 15
## criminal-damage-arson 31 22
## drugs 18 17
## other-crime 6 3
## other-theft 2 6
## possession-of-weapons 9 10
## public-order 18 17
## robbery 4 4
## shoplifting 51 45
## theft-from-the-person 0 0
## vehicle-crime 6 2
## violent-crime 114 64
##
## Formal action is not in the public interest
## anti-social-behaviour 0
## bicycle-theft 0
## burglary 0
## criminal-damage-arson 0
## drugs 1
## other-crime 0
## other-theft 1
## possession-of-weapons 0
## public-order 2
## robbery 0
## shoplifting 1
## theft-from-the-person 0
## vehicle-crime 0
## violent-crime 4
##
## Further action is not in the public interest
## anti-social-behaviour 0
## bicycle-theft 0
## burglary 2
## criminal-damage-arson 2
## drugs 10
## other-crime 6
## other-theft 1
## possession-of-weapons 1
## public-order 12
## robbery 0
## shoplifting 11
## theft-from-the-person 0
## vehicle-crime 0
## violent-crime 37
##
## Further investigation is not in the public interest
## anti-social-behaviour 0
## bicycle-theft 0
## burglary 0
## criminal-damage-arson 0
## drugs 0
## other-crime 8
## other-theft 0
## possession-of-weapons 0
## public-order 0
## robbery 0
## shoplifting 0
## theft-from-the-person 0
## vehicle-crime 0
## violent-crime 0
##
## Investigation complete; no suspect identified
## anti-social-behaviour 0
## bicycle-theft 216
## burglary 154
## criminal-damage-arson 363
## drugs 15
## other-crime 13
## other-theft 350
## possession-of-weapons 9
## public-order 193
## robbery 38
## shoplifting 299
## theft-from-the-person 61
## vehicle-crime 350
## violent-crime 595
##
## Local resolution Offender given a caution
## anti-social-behaviour 0 0
## bicycle-theft 1 0
## burglary 0 0
## criminal-damage-arson 14 4
## drugs 98 6
## other-crime 0 2
## other-theft 1 2
## possession-of-weapons 11 6
## public-order 7 1
## robbery 1 0
## shoplifting 34 5
## theft-from-the-person 0 0
## vehicle-crime 1 0
## violent-crime 71 35
##
## Status update unavailable
## anti-social-behaviour 0
## bicycle-theft 8
## burglary 13
## criminal-damage-arson 6
## drugs 9
## other-crime 10
## other-theft 12
## possession-of-weapons 5
## public-order 14
## robbery 5
## shoplifting 4
## theft-from-the-person 2
## vehicle-crime 5
## violent-crime 84
##
## Suspect charged as part of another case
## anti-social-behaviour 0
## bicycle-theft 0
## burglary 0
## criminal-damage-arson 0
## drugs 0
## other-crime 0
## other-theft 0
## possession-of-weapons 0
## public-order 0
## robbery 0
## shoplifting 1
## theft-from-the-person 0
## vehicle-crime 0
## violent-crime 0
##
## Unable to prosecute suspect Under investigation Unknown
## anti-social-behaviour 0 0 677
## bicycle-theft 8 1 0
## burglary 20 20 0
## criminal-damage-arson 117 17 0
## drugs 13 19 0
## other-crime 34 9 0
## other-theft 94 21 0
## possession-of-weapons 12 10 0
## public-order 218 44 0
## robbery 35 7 0
## shoplifting 76 22 0
## theft-from-the-person 10 3 0
## vehicle-crime 23 19 0
## violent-crime 1299 247 0
The two-way table above represents a cross-tabulation of the counts of crime incidents based on their category and outcome status.
It shows how many incidents fall into each outcome status for each crime category. For example, in the “Investigation complete; no suspect identified” outcome status, higher counts were observed for categories such as “Bicycle Theft” (216), “Burglary” (154), “Criminal Damage/Arson” (363), “Other Theft” (350), “Shoplifting” (299), “Vehicle Crime” (350), and “Violent Crime” (595). The “Unknown” outcome status has a high count across various crime categories, indicating cases where the outcome status is not specified or known.
It can be deduced that certain types of crimes may have different resolution statuses. For example, crimes like “Bicycle Theft” and “Shoplifting” tend to have higher counts of “Investigation complete; no suspect identified” outcome status, suggesting that identifying suspects might be more challenging for these types of crimes. Crimes categorised as “violent-crime” have a relatively high count across various outcome statuses, indicating the complexity and severity of these incidents.
Understanding the distribution of outcome statuses for different crime categories can help law enforcement agencies allocate resources effectively and prioritise investigations based on the likelihood of resolution. It can also inform decision-making and resource allocation strategies for crime prevention and law enforcement efforts.
# Explore the distribution of crime by category using pie chart
# Load libraries
library(ggplot2)
library(plotly)
# Group by category and calculate frequencies
crime_by_category <- colc_crime_data %>%
group_by(category) %>%
summarize(frequency = n()) %>%
arrange(desc(frequency))
# Calculate percentage
crime_by_category$percentage <- crime_by_category$frequency / sum(crime_by_category$frequency) * 100
print(crime_by_category)
## # A tibble: 14 × 3
## category frequency percentage
## <chr> <int> <dbl>
## 1 violent-crime 2633 38.3
## 2 anti-social-behaviour 677 9.84
## 3 criminal-damage-arson 581 8.45
## 4 shoplifting 554 8.05
## 5 public-order 532 7.73
## 6 other-theft 491 7.14
## 7 vehicle-crime 406 5.90
## 8 bicycle-theft 235 3.42
## 9 burglary 225 3.27
## 10 drugs 208 3.02
## 11 robbery 94 1.37
## 12 other-crime 92 1.34
## 13 theft-from-the-person 76 1.10
## 14 possession-of-weapons 74 1.08
# Create a pie chart
pie_chart <- plot_ly(crime_by_category, labels = ~category, values = ~frequency, type = 'pie',
textinfo = 'label+percent', insidetextfont = list(color = '#FFFFFF', size = 10)) %>%
layout(title = "Distribution of Crimes by Category")
# Show the pie chart
pie_chart
The pie chart (using interactive plot) represents the distribution of crime by category in Colchester in 2023. Hovering the mouse over the plot will show the frequency and percentage of the crimes.
The violent-crime (38.28%) represents the most prevalent category in the dataset with 2633 reported incidents. This high frequency suggests a significant concern for public safety and underscores the need for interventions to address instances of violence within the community. While still substantial, anti-social behaviour (9.84%) accounts for a smaller proportion of reported incidents compared to violent crime. However, with 677 reported cases, it remains a notable issue that may contribute to community disruption and discomfort. The presence of 581 reported incidents of criminal damage/arson (8.45%) highlights concerns regarding property-related offences and deliberate acts of vandalism. Addressing these incidents is essential for preserving public and private property.
With 554 reported cases, shoplifting (8.05%)constitutes a significant portion of reported crimes, indicating instances of theft occurring in commercial establishments. This may have economic implications for businesses and consumers alike. Public-order-offences (7.73%), with 532 reported incidents, signify challenges related to maintaining order and preventing disturbances within the community. Addressing public order issues is crucial for ensuring residents’ safe and peaceful environment. The other-theft (7.14%) category includes thefts not classified elsewhere, with 491 reported incidents. The diversity of theft-related offences underscores the need for comprehensive strategies to combat various forms of theft.
The presence of 406 reported incidents of vehicle-crime (5.90%) suggests concerns regarding thefts or vandalism involving vehicles. Protecting vehicles and preventing such offences is essential for vehicle owners and the community at large. While representing a smaller proportion of reported crimes, the 235 reported incidents of bicycle theft (3.42%) indicate instances of theft targeting bicycles. Addressing bicycle theft can contribute to promoting cycling as a sustainable mode of transportation. With 225 reported incidents, burglary (3.37%) involves unlawful entry into buildings intending to commit theft or other crimes. Preventing burglaries is crucial for safeguarding residential and commercial properties. The presence of 208 reported incidents of drug-related offences (3.02%) highlights concerns regarding the possession, distribution, or trafficking of controlled substances. Addressing drug-related activities is essential for combating substance abuse and associated criminal behaviours.
With 94 reported incidents, robbery (1.37%) involves theft or attempted theft that includes force or the threat of force against individuals. While constituting a relatively small proportion of reported crimes, instances of robbery are concerning due to their potentially violent nature. The other-crime category (1.34%), with 92 reported incidents, encompasses crimes that do not fall into the preceding categories. It reflects a diverse range of offences not specifically categorised elsewhere; this highlights the complexity of criminal activities within the community. With 76 reported incidents, theft-from-the-person (1.10%) involves direct theft from individuals, such as pickpocketing or purse snatching. While representing a smaller portion of reported crimes, it underscores individuals’ vulnerability to targeted thefts. Finally, 74 reported incidents of possession of weapons (1.08%) signifies concerns regarding the unlawful possession of weapons or firearms within the community. Addressing weapons-related offences is essential for maintaining public safety and preventing potential harm.
# Visualise the distribution of crime resolution status (Out_come status)
# Count the number of occurrences for each outcome status
outcome_counts <- colc_crime_data %>%
count(outcome_status, sort = TRUE)
# Count the number of occurrences for each outcome status
outcome_counts <- colc_crime_data %>%
count(outcome_status, sort = TRUE)
outcome_counts
## # A tibble: 14 × 2
## outcome_status n
## <chr> <int>
## 1 Investigation complete; no suspect identified 2656
## 2 Unable to prosecute suspect 1959
## 3 Unknown 677
## 4 Under investigation 439
## 5 Awaiting court outcome 260
## 6 Local resolution 239
## 7 Court result unavailable 206
## 8 Status update unavailable 177
## 9 Action to be taken by another organisation 104
## 10 Further action is not in the public interest 82
## 11 Offender given a caution 61
## 12 Formal action is not in the public interest 9
## 13 Further investigation is not in the public interest 8
## 14 Suspect charged as part of another case 1
# Define custom colors
custom_colors <- c("chartreuse4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "dodgerblue4", "deeppink3", "slategray3", "#bcbd22", "#17becf", "green2", "coral1", "cadetblue1", "royalblue1")
# Plotting the distribution of outcome status with custom colors
outcome_plot <- ggplot(outcome_counts, aes(x = reorder(outcome_status, n), y = n, fill = outcome_status, text = paste("Outcome Status:", outcome_status, "<br>Number of Incidents:", n))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = custom_colors[1:length(unique(colc_crime_data$outcome_status))]) +
labs(title = "Distribution of Crime Outcome Status in Colchester (2023)",
x = "Outcome Status",
y = "Number of Incidents") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 2, size =10),
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
# Convert ggplot to plotly object and enable hover
outcome_plotly <- ggplotly(outcome_plot, tooltip = c("text"))
# Display the plot
outcome_plotly
The interactive bar plot above gives insights into the distribution of how reported crimes are resolved in Colchester for 2023. As you hover over each bar, the count for each outcome are shown.
The majority of incidents (2656) had investigations completed, but no suspects were identified. Following closely were incidents (1959) where suspects were known but could not be prosecuted. Notably, there were 677 cases with unknown outcomes, indicating gaps in data availability. Additionally, ongoing investigations accounted for 439 incidents, while 260 awaited court outcomes. A significant portion (239) was resolved locally, without formal legal proceedings. However, information on court results and status updates was unavailable for 206 and 177 incidents, respectively.
Furthermore, 104 cases required action from other organizations, suggesting collaboration efforts. In 82 instances, further action was deemed against public interest, while formal action was not pursued in 9 cases. Moreover, in each of the eight incidents, there were decisions where further investigation or formal action was not in the public interest. Lastly, only one incident involved a suspect being charged as part of another case, hinting at potential interrelated criminal activities.
# Crime Time Series Map
library(plotly)
# Convert date column to character format
colc_crime_data$date <- as.character(colc_crime_data$date)
# Create a time series map with advanced layers
colc_crime_map <- plot_ly(data = colc_crime_data, type = "scattermapbox", mode = "markers") %>%
add_trace(lat = ~lat, lon = ~long, color = ~category, colors = "Set1", size = 5,
text = ~paste("Category: ", category, "<br>Date: ", date),
hoverinfo = "text",
frame = ~date,
frame_style = list(title = "Date: %{frame}")) %>%
layout(title = "Crime Time Series Map for Colchester (2023)",
mapbox = list(style = "carto-positron",
zoom = 10,
center = list(lon = mean(colc_crime_data$long), lat = mean(colc_crime_data$lat))),
xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
legend = list(title = "Category"),
updatemenus = list(list(
buttons = list(
list(
args = list(frame = list(duration = 1000, redraw = TRUE),
fromcurrent = TRUE),
label = "Play",
method = "animate"
),
list(
args = list(frame = list(duration = 0, redraw = TRUE),
mode = "immediate"),
label = "Pause",
method = "animate"
)
),
direction = "left",
pad = list(r = 10, t = 87),
showactive = FALSE,
type = "buttons",
x = 0.1,
xanchor = "right",
y = 0,
yanchor = "top"
))
)
# Display the time series map
colc_crime_map
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
## Warning: 'scattermapbox' objects don't have these attributes: 'frame_style'
## Valid attributes include:
## 'below', 'connectgaps', 'customdata', 'customdatasrc', 'fill', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'lat', 'latsrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lon', 'lonsrc', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'selected', 'selectedpoints', 'showlegend', 'stream', 'subplot', 'text', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
The plot above is an interactive map of the crime time series for Colchester in 2023. It is plotted to explore crime data over time. The map uses a carto-positron style, a light-coloured base map emphasising streets and labels. Zoom level and centre are set to show the entirety of Colchester. Each marker represents a crime incident, and the colour of the marker corresponds to the crime category (legend on the right).
When you hover over a marker, additional information specifically - “category” which shows the type of crime for that incidence (e.g., anti-social-behaviour, theft-from-the-person) and Date: The date on which the crime occurred are shown. An animation (a time slider) with a play and pause button in the bottom left corner allows you to animate the map over time. By clicking “Play,” the markers appear sequentially, representing the crime incidents throughout 2023. This animation helps visualise temporal patterns in crime occurrences.
Moving across the months by using the slide, you will notice that there is an increase in the intensity of crime as you move down the month suggesting that periods close to festive season records more frequent crime rate than other months. This can be linked to the festive seasons where the people want to do anything at all cost to have some pleasure. It is worth noting that this period is to be given adequate attention by the law enforcement agents and further research can be carried out to have deeper understanding as to other factors responsible for this.
names(colc_crime_data)
## [1] "category" "persistent_id" "date" "lat"
## [5] "long" "street_id" "street_name" "id"
## [9] "location_type" "outcome_status"
library(dplyr)
library(ggplot2)
# Get the top ten streets
# Group the data by street name and summarize the total crime count for each street
street_crime_counts <- colc_crime_data %>%
group_by(street_name) %>%
summarise(total_crime_count = n()) %>%
ungroup()
# Arrange the data in descending order of crime count and select the top 10 streets
top_ten_streets <- street_crime_counts %>%
arrange(desc(total_crime_count)) %>%
head(10)
# plot an interactive histogram plot for 10 most dangerous streets in
# Filter the data to include only the top 10 most dangerous streets
top_crime_streets <- street_crime_counts %>%
arrange(desc(total_crime_count)) %>%
head(10) # Select the top 10 streets
# Create the histogram
histogram <- plot_ly(top_crime_streets, x = ~street_name, y = ~total_crime_count, type = "bar",
marker = list(color = ~total_crime_count,
colorscale = "Viridis",
line = list(color = "black")),
hoverinfo = "text",
text = ~paste("Street: ", street_name, "<br>Total Crime Count: ", total_crime_count)) %>%
layout(title = "Top 10 Most Dangerous Streets in Colchester (2023)",
xaxis = list(title = "Street Name"),
yaxis = list(title = "Total Crime Count"),
hovermode = "closest",
showlegend = FALSE)
# Display the interactive histogram
histogram
The interactive histogram plot identifies the top 10 streets with the highest total crime counts. As we hover the mouse on the bars, the interactive plot shows the street names and the total crime count. The streets “On or near” and “On or near Shopping Area” have the highest crime counts (495 and 328, respectively). This suggests a potential concentration of crime around shopping areas, possibly due to factors like increased opportunity for theft or vandalism.
Other locations with high crime counts include “On or near Supermarket” (243), “Parking Area” (171), and “Nightclub” (142). These locations might also attract criminal activity due to similar reasons. Streets like “Cowdray Avenue” (164), “St Nicholas Street” (150), and “Balkerne Gardens” (148) also appear on the list, indicating potential crime hotspots in residential or public areas. “Church Street” (144) and “George Street” (117) round out the top 10, suggesting some level of crime activity on these streets as well.
# Create a new column for converted date
colc_crime_data <- colc_crime_data %>%
mutate(converted_date = as.Date(date))
# Aggregate data by year-month
colc_crime_data_agg <- colc_crime_data %>%
mutate(year_month = format(converted_date, "%Y-%m")) %>%
group_by(year_month, lat, long) %>%
summarise(crime_count = n(), .groups = "drop")
# Create a ggplot object
crime_map_gg <- ggplot(colc_crime_data_agg, aes(x = long, y = lat, size = crime_count, color = crime_count)) +
geom_point(shape = 21, fill = "black") + # Change shape and fill color
scale_size_continuous(range = c(1, 5)) + # Adjust size range
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Crime Incidence over Crime Count in Colchester (2023)",
x = "Longitude",
y = "Latitude",
size = "Crime Count",
color = "Crime Count") +
theme_minimal()
# Convert ggplot object to plotly object
crime_map_plotly <- ggplotly(crime_map_gg, tooltip = "text")
# Display the interactive plot
crime_map_plotly
The interactive scatter plot visualises crime incidence over crime count using the latitude and longitude from the dataset. It shows the location of the crimes plotted as circles, with the size and colour of the circles corresponding to the number of crimes that occurred at that location. The data points are aggregated by month.There seems to be a higher concentration of crimes in the central and southern parts of colchester, this could be due to a number of factors such as population density, commercial activity, or the presence of certain types of establishments. Furthermore, the colour of the circles which indicates the severity of the crimes, with red circles representing areas with more serious crimes and blue circles representing areas with less serious crimes.
library(leaflet)
# Create a leaflet map
crime_map_leaflet <- leaflet(data = colc_crime_data) %>%
# Add tile layers for the base map
addTiles() %>%
# Add crime data as circle markers with different colors for each category
addCircleMarkers(
radius = 3, # Set a fixed radius for the circles
color = ~category, # Color based on category
stroke = FALSE, # No border
fillOpacity = 0.6, # Opacity of the fill
popup = ~paste("Category:", category, "<br>Date:", date), # Popup information
label = ~paste("Category:", category) # Label information
) %>%
# Add scale bar
addScaleBar(position = "bottomright") %>%
# Set map options
setView(lng = mean(colc_crime_data$long), lat = mean(colc_crime_data$lat), zoom = 10) # Set the initial view
## Assuming "long" and "lat" are longitude and latitude, respectively
# Display the map
crime_map_leaflet
As you hover on the points, different crime categories are noticed across the locations
# read the colchester 2023 climate dataset
colc_clm23 <- read.csv("temp2023.csv")
# check the first few rows
head(colc_clm23)
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1 3590 2023-12-31 8.7 10.6 4.4 7.2
## 2 3590 2023-12-30 6.6 9.7 4.4 4.2
## 3 3590 2023-12-29 9.9 11.4 6.9 6.0
## 4 3590 2023-12-28 9.9 11.5 4.0 7.5
## 5 3590 2023-12-27 5.8 10.6 3.9 3.7
## 6 3590 2023-12-26 9.8 12.7 6.3 7.6
## HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1 89.6 S 25.0 63.0 999.0 6.2 8.0 8.0
## 2 85.5 WSW 22.7 50.0 1006.9 0.4 4.6 6.5
## 3 77.2 SW 32.8 61.2 1003.6 0.8 6.5 6.7
## 4 84.6 SSW 32.2 70.4 1003.2 2.8 6.8 7.1
## 5 86.4 SW 13.2 37.1 1016.4 2.0 4.0 6.9
## 6 86.9 WSW 23.5 46.3 1006.2 4.4 6.5 7.4
## SunD1h VisKm PreselevHp SnowDepcm
## 1 0.0 26.3 NA NA
## 2 1.1 48.3 NA NA
## 3 0.1 26.7 NA NA
## 4 0.0 25.1 NA NA
## 5 3.2 30.1 NA NA
## 6 0.0 45.8 NA NA
# show the names of the dataset column
names(colc_clm23)
## [1] "station_ID" "Date" "TemperatureCAvg" "TemperatureCMax"
## [5] "TemperatureCMin" "TdAvgC" "HrAvg" "WindkmhDir"
## [9] "WindkmhInt" "WindkmhGust" "PresslevHp" "Precmm"
## [13] "TotClOct" "lowClOct" "SunD1h" "VisKm"
## [17] "PreselevHp" "SnowDepcm"
# Check the structure of the data
str(colc_clm23)
## 'data.frame': 365 obs. of 18 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : chr "2023-12-31" "2023-12-30" "2023-12-29" "2023-12-28" ...
## $ TemperatureCAvg: num 8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
## $ TemperatureCMax: num 10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
## $ TemperatureCMin: num 4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
## $ TdAvgC : num 7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
## $ HrAvg : num 89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
## $ WindkmhDir : chr "S" "WSW" "SW" "SSW" ...
## $ WindkmhInt : num 25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
## $ WindkmhGust : num 63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
## $ PresslevHp : num 999 1007 1004 1003 1016 ...
## $ Precmm : num 6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
## $ TotClOct : num 8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
## $ lowClOct : num 8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
## $ SunD1h : num 0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
## $ VisKm : num 26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
## $ PreselevHp : logi NA NA NA NA NA NA ...
## $ SnowDepcm : int NA NA NA NA NA NA NA NA NA NA ...
The climate dataset has 385 observations and 18 variables.
# Convert Date column to correct Date format
colc_clm23$Date <- as.Date(colc_clm23$Date)
# Confirm that the date column is now formatted appropriately
head(colc_clm23$Date)
## [1] "2023-12-31" "2023-12-30" "2023-12-29" "2023-12-28" "2023-12-27"
## [6] "2023-12-26"
The date column has been formatted correctly to date format
summary(colc_clm23)
## station_ID Date TemperatureCAvg TemperatureCMax
## Min. :3590 Min. :2023-01-01 Min. :-2.60 Min. : 1.70
## 1st Qu.:3590 1st Qu.:2023-04-02 1st Qu.: 7.20 1st Qu.:10.60
## Median :3590 Median :2023-07-02 Median :10.40 Median :14.20
## Mean :3590 Mean :2023-07-02 Mean :10.92 Mean :15.13
## 3rd Qu.:3590 3rd Qu.:2023-10-01 3rd Qu.:15.80 3rd Qu.:20.00
## Max. :3590 Max. :2023-12-31 Max. :23.10 Max. :30.40
##
## TemperatureCMin TdAvgC HrAvg WindkmhDir
## Min. :-6.200 Min. :-4.400 Min. :43.10 Length:365
## 1st Qu.: 3.200 1st Qu.: 4.400 1st Qu.:75.60 Class :character
## Median : 6.300 Median : 7.600 Median :81.70 Mode :character
## Mean : 6.365 Mean : 7.578 Mean :81.25
## 3rd Qu.:10.600 3rd Qu.:11.200 3rd Qu.:87.90
## Max. :16.300 Max. :17.500 Max. :97.90
##
## WindkmhInt WindkmhGust PresslevHp Precmm
## Min. : 6.20 Min. :13.00 Min. : 967.4 Min. : 0.000
## 1st Qu.:12.00 1st Qu.:31.50 1st Qu.:1006.3 1st Qu.: 0.000
## Median :16.10 Median :38.90 Median :1014.3 Median : 0.000
## Mean :16.81 Mean :40.87 Mean :1013.6 Mean : 1.866
## 3rd Qu.:20.20 3rd Qu.:48.20 3rd Qu.:1021.7 3rd Qu.: 1.150
## Max. :37.50 Max. :98.20 Max. :1045.1 Max. :33.600
## NA's :27
## TotClOct lowClOct SunD1h VisKm
## Min. :0.000 Min. :1.800 Min. : 0.000 Min. : 3.60
## 1st Qu.:3.600 1st Qu.:5.800 1st Qu.: 1.150 1st Qu.:22.70
## Median :5.100 Median :6.700 Median : 4.700 Median :31.50
## Mean :4.988 Mean :6.443 Mean : 5.127 Mean :32.11
## 3rd Qu.:7.000 3rd Qu.:7.400 3rd Qu.: 8.050 3rd Qu.:41.50
## Max. :8.000 Max. :8.000 Max. :15.400 Max. :72.90
## NA's :13 NA's :82
## PreselevHp SnowDepcm
## Mode:logical Min. :1
## NA's:365 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
## NA's :364
The summary output of the climate dataset (colc_clm23) dataset provides a comprehensive overview of climate data recorded at station ID 3590 throughout 2023. It includes temperature, humidity, wind speed and direction, pressure, precipitation, cloudiness, sunshine duration, visibility, and snow depth. For example, temperatures ranged from -2.60°C to 23.10°C, with a median of 10.40°C for average temperature, 14.20°C for maximum temperature, and 6.30°C for minimum temperature. Humidity varied from 43.10% to 97.90%, with a median of 81.70%. Wind speed ranged from 6.20 km/h to 37.50 km/h, with gusts reaching up to 98.20 km/h. Sea-level pressure ranged from 967.4 hPa to 1045.1 hPa. Precipitation totals had a mean value of 1.866 mm, with a maximum of 33.600 mm and 27 missing values. Cloudiness varied from 0.000 to 8.000 octants, with 13 missing values for low-level cloudiness. Sunshine duration ranged from 0.000 to 15.400 hours, with a mean of 5.127 hours and 82 missing values. Visibility ranged from 3.60 km to 72.90 km. Snow depth remained constant at 1 cm for all observations except for 364 missing values.
# Check missing values
clm_missing_val <- colSums(is.na(colc_clm23))
clm_missing_val
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin
## 0 0 0 0 0
## TdAvgC HrAvg WindkmhDir WindkmhInt WindkmhGust
## 0 0 0 0 0
## PresslevHp Precmm TotClOct lowClOct SunD1h
## 0 27 0 13 82
## VisKm PreselevHp SnowDepcm
## 0 365 364
The result above reveals missing values in several columns of the climate dataset Precmm has 27 missing values, lowClOct has 13 missing values, SunD1h has 82 missing values, PreselevHp has 365 missing values, and SnowDepcm has 364 missing values.
# Handling missing data
# Mean imputation for Precmm and lowClOct
colc_clm23$Precmm[is.na(colc_clm23$Precmm)] <- mean(colc_clm23$Precmm, na.rm = TRUE)
colc_clm23$lowClOct[is.na(colc_clm23$lowClOct)] <- mean(colc_clm23$lowClOct, na.rm = TRUE)
colc_clm23$SunD1h[is.na(colc_clm23$SunD1h)] <- mean(colc_clm23$SunD1h, na.rm = TRUE)
# Exclude PreselevHp and SnowDepcm from the dataset
colc_clm23 <- colc_clm23[, !(names(colc_clm23) %in% c("PreselevHp", "SnowDepcm"))]
# confirm missing data is handled
str(colc_clm23)
## 'data.frame': 365 obs. of 16 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : Date, format: "2023-12-31" "2023-12-30" ...
## $ TemperatureCAvg: num 8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
## $ TemperatureCMax: num 10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
## $ TemperatureCMin: num 4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
## $ TdAvgC : num 7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
## $ HrAvg : num 89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
## $ WindkmhDir : chr "S" "WSW" "SW" "SSW" ...
## $ WindkmhInt : num 25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
## $ WindkmhGust : num 63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
## $ PresslevHp : num 999 1007 1004 1003 1016 ...
## $ Precmm : num 6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
## $ TotClOct : num 8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
## $ lowClOct : num 8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
## $ SunD1h : num 0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
## $ VisKm : num 26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
Given the missing values in Precmm, lowClOct, SunD1h (27 and 13, 82 respectively), a mean imputation approach was applied to replace these missing values. This is a common approach to handle missing data. It is assumed that the missing values are missing completely at random or missing at random. PreselevHp and SnowDepcm with missing values of 365 and 364, respectively) were dropped from the dataset due to all of their values missing. The result showed that there are no more missing values and the empty columns are now dropped. Hence, further analysis can now be carried out.
# Load libraries
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Create time series plots to visualize the trends and variations of different climate variables over time
# Temperature Variation plot with smoothing
temperature_plot <- ggplot(colc_clm23, aes(x = Date)) +
geom_line(aes(y = TemperatureCAvg, color = "Average Temperature")) +
geom_line(aes(y = TemperatureCMax, color = "Max Temperature")) +
geom_line(aes(y = TemperatureCMin, color = "Min Temperature")) +
geom_smooth(aes(y = TemperatureCAvg), method = "loess", color = "red", se = FALSE) + # Smoothing
geom_smooth(aes(y = TemperatureCMax), method = "loess", color = "blue", se = FALSE) + # Smoothing
geom_smooth(aes(y = TemperatureCMin), method = "loess", color = "green", se = FALSE) + # Smoothing
labs(title = "Temperature Variation",
x = "Date",
y = "Temperature (°C)",
color = "Type") +
theme_minimal() + # Remove gridlines
theme(panel.grid = element_blank()) # Remove gridlines
# Visualize precipitation data with smoothing
precipitation_plot <- ggplot(colc_clm23, aes(x = Date, y = Precmm)) +
geom_bar(stat = "identity", fill = "blue") +
geom_smooth(aes(y = Precmm), method = "loess", color = "red", se = FALSE) + # Smoothing
labs(title = "Daily Precipitation",
x = "Date",
y = "Precipitation (mm)") +
theme_minimal() + # Remove gridlines
theme(panel.grid = element_blank()) # Remove gridlines
# Visualize wind speed data with smoothing
wind_speed_plot <- ggplot(colc_clm23, aes(x = Date, y = WindkmhInt)) +
geom_line(color = "green") +
geom_smooth(aes(y = WindkmhInt), method = "loess", color = "blue", se = FALSE) + # Smoothing
labs(title = "Wind Speed Variation",
x = "Date",
y = "Wind Speed (km/h)") +
theme_minimal() + # Remove gridlines
theme(panel.grid = element_blank()) # Remove gridlines
# Arrange plots in a 3x1 grid
grid.arrange(temperature_plot, precipitation_plot, wind_speed_plot, nrow = 3, ncol = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The time series above shows the variations of three climate variables in Colchester over a year (2023). The variables plotted are Average temperature (°C), daily precipitation(mm) and wind speed (km/h). Each plot uses a line to show the trend of the variable over time, Additionally, a smoothed line is overlaid on the temperature plot to highlight the general variations without the high frequency fluctuations. The precipitation plots uses bar to represented the amount of precipitation each day.
The average temperature follows a seasonal pattern, with higher temperatures in the summer months and lower temperatures in the winter months. The smoothed line suggests a gradual increase in temperature throughout the year, potentially indicating a warming trend. There appears to be some variation in precipitation throughout the year with some periods such as July ending, the beginning of August, and the beginning of November, all receiving more rain than months. Wind speed also appears to vary throughout the year, higher speed sometimes in January and May with potentially higher speeds in the winter.
# Plot time series of TemperatureCMax and TemperatureCMin
temperature_time_series <- ggplot(colc_clm23, aes(Date)) +
geom_smooth(aes(y = TemperatureCMax, color = "TemperatureCMax"), method = "loess", se = FALSE) + # Add Smoothed Trend Lines
geom_smooth(aes(y = TemperatureCMin, color = "TemperatureCMin"), method = "loess", se = FALSE) + # Add Smoothed Trend Lines
labs(x = "Date", y = "Temperature (°C)", color = "Variable") +
theme_minimal() +
scale_color_manual(values = c("TemperatureCMax" = "red", "TemperatureCMin" = "blue")) +
ggtitle("Time Series of Maximum and Minimum Temperatures") +
theme(plot.title = element_text(size = 12))
# Plot time series of Precmm, SunD1h, WindkmhInt and TotClOct
weather_time_series <- ggplot(colc_clm23, aes(Date)) +
geom_line(aes(y = Precmm, color = "Precipitation (mm)"), linetype = "dashed") +
geom_line(aes(y = SunD1h, color = "Sunshine Duration (hours)"), linetype = "dotted") +
geom_line(aes(y = WindkmhInt, color = "Wind Speed (km/h)")) +
geom_line(aes(y = TotClOct, color = "Total Cloudiness (octants)")) +
labs(x = "Date", y = "Value", color = "Variable") +
theme_minimal() +
scale_color_manual(values = c("Precipitation (mm)" = "green", "Sunshine Duration (hours)" = "orange", "Wind Speed (km/h)" = "purple", "Total Cloudiness (octants)" = "blue")) +
ggtitle("Time Series of Precipitation, Sunshine Duration, Wind Speed, and Total Cloudiness") +
theme(plot.title = element_text(size = 12))
# Arrange Plots
par(mfrow = c(2, 1)) # Set the layout to 2 rows and 1 column
print(temperature_time_series)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
print(weather_time_series)
The plots above shows time series plots for maximum and minimum
temperatures and of four weather variables in Colchester over a year
(2023).The use of smoothing helps highlight the general trends without
being distracted by the high-frequency fluctuations in the data, though
it may obscure some of the short term variations in temperature.
For the temperature plot, the smoothed lines shows a clear seasonal pattern, with high temperatures in the summer months and low temperatures in the winter months. This is typical for most temperate climates in the Northern Hemisphere. The difference between the maximum and minimum temperature appears to be larger in the summer months compared to the winter months. This suggests that summers in Colchester experience warmer highs and cooler lows, while winters have milder variations.
For the weather variables, the precipitation appears to have some variation throughout the year with potentially higher amount in some periods such as in October compared to others. The sunshine duration appears to fluctuate throughout the year with potentially longer duration in the summer months based on the higher peaks. Wind speed also appears to vary throughout the year, with possible high speeds during some periods during the winter. The total Cloudiness line shows variations with potentially higher values corresponding to periods with lower sunshine duration (indicated by the orange line). This suggests a possible link between cloud cover and sunshine hours, as expected.
# Define function to extract season from date
season <- function(date) {
months <- as.numeric(format(date, "%m"))
ifelse(months %in% 3:5, "Spring",
ifelse(months %in% 6:8, "Summer",
ifelse(months %in% 9:11, "Autumn", "Winter")))
}
# Create boxplot for cloud cover by season with expanded upper part
ggplot(colc_clm23, aes(x = factor(season(Date)), y = TotClOct)) +
geom_boxplot(fill = "lightblue", color = "blue") +
labs(title = "Distribution of Total Cloud Cover by Season",
x = "Season",
y = "Total Cloud Cover (Octants)") +
theme_minimal() +
ylim(0, quantile(colc_clm23$TotClOct, 1)) # Expand upper part of y-axis
The boxplot visualises the distribution of total cloud cover (TotClOct)
across seasons in Colchester. The boxplot shows the spread of TotClOct
values for each season (Winter, Spring, Summer, Autumn). The horizontal
lines within each box represent the median TotClOct value for that
season. The boxes indicate that the distribution of TotClOct can vary
across seasons.
The box for Spring suggests the most extensive spread, implying high variability in TotClOct values during this season. The Summer box appears to have a narrower spread compared to Spring, indicating less variation in TotClOct during summer months. The Autumn and Winter boxes have similar spreads. The whiskers extending from the boxes represent the range of TotClOct values within 1.5 times the interquartile range (IQR) from the quartiles. Any data points beyond the whiskers are outliers and are plotted as individual circles.
The boxplot does not definitively indicate which season has the highest or lowest total cloud cover. However, it suggests that Spring might have the most variable cloud cover, while Summer might have a more consistent pattern. To determine which season has the highest/lowest median cloud cover, we can examine the relative positions of the median lines (horizontal lines within the boxes). Spring’s median is the highest, followed by Winter and Autumn. Summer might have the lowest median TotClOct.
The datasets have exciting features from the analysis and visualisation that was carried out. However, both data only cover a single year, so it’s difficult to draw conclusions about long-term trends. Having more datasets would allow us to see if the patterns observed this year are consistent with historical trends or if they represent an anomaly.
https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/
https://sape.inf.usi.ch/quick-reference/ggplot2/colour#:~:text=Red%20Green%20Blue%20(RGB)%20Colour,of%20%5B0%2C%201%5D).
MA304 Lecture slides and lab solutions.